Coding for DS and DM
R coding module

Lecture 2

Andrea Cappozzo
andrea.cappozzo@unimi.it
AndreaCappozzo
andreacappozzo.rbind.io

Quickstart: Installing and loading libraries

  • Example:
install.packages("mclust", dependencies = TRUE)
library("mclust")
data("diabetes")
head(diabetes)
   class glucose insulin sspg
1 Normal      80     356  124
2 Normal      97     289  117
3 Normal     105     319  143
4 Normal      90     356  199
5 Normal      90     323  240
6 Normal      86     381  157

Basic data types (1)

Numeric

  • Decimal values are called “numeric” in R.
  • It is the default computational data type.
  • If a decimal value is assigned to a variable x, x will be of numeric type:
x <- 14.33 ### assign a decimal value
class(x)  ### class of x
[1] "numeric"
typeof(x)  ### type of R object of x
[1] "double"

Basic data types (2)

General numeric

  • Even if an integer is assigned to a variable x, it is still numeric:
x <- 10 ### assign an integer value
is.integer(x) ### is x an integer?
[1] FALSE

Basic data types (3)

Integer

  • To create an integer variable, as.integer() can be invoked:
x <- as.integer(11) ### assign an integer data type
is.integer(x)       ### is x an integer?
[1] TRUE
  • Integers can also be declared by appending an L suffix:
y <- 22L       ### assign an integer data type
is.integer(y)  ### is y an integer?
[1] TRUE

Basic data types (4)

Complex numbers

z <- 5+2i  ### assign a complex number
typeof(z)   ### class of z
[1] "complex"
  • Basic functions supporting complex arithmetic are:
Re(z)    ### real part
[1] 5
Im(z)    ### imaginary part
[1] 2
Mod(z)   ### modulus
[1] 5.385165

Basic data types (5)

Logical

  • A logical value is often created via comparison between variables:
x <- 2 > 1    ### is 2 greater than 1?
x
[1] TRUE

Basic data types (6)

Logical

  • Standard logical operations are & (and), | (or), and ! (not):
u <- TRUE
v <- FALSE
u & v
[1] FALSE

Basic data types (7)

Character (string)

  • A character object is used to represent string values in R. Two character values can be concatenated with the paste function:
address <- 'Via'
domain <- 'Conservatorio'
paste(address, domain, sep = ' ')
[1] "Via Conservatorio"

Basic data types (8)

Character (string)

  • To substitute terms in a string use sub():
my_str = "Via Conservatorio"
sub("Via", "Piazza", my_str)
[1] "Piazza Conservatorio"
  • More functions for string manipulation can be found in the R documentation using ?sub.
  • A very convenient package to work with strings is stringr

Basic data structures (1)

Vectors

  • The basic data structure in R is the vector. Vectors are usually created with the c() function, short for “concatenate”:
c(1,2,4,8,16,32)
[1]  1  2  4  8 16 32
c("Italy","Spain","France","UK","Ireland","Belgium")
[1] "Italy"   "Spain"   "France"  "UK"      "Ireland" "Belgium"

Basic data structures (2)

Vectors

  • Vectors can contain only equal data types. If this is not the case, some conversion takes place:
c(FALSE,1,"2")
[1] "FALSE" "1"     "2"    
  • In this case FALSE and 1 will be converted to characters

Basic data structures (3)

Named vectors

  • These are vectors with attached labels:
c('Tottenham' = 14, 'Aston Villa' = 12, 'Brentford' = 6)
  Tottenham Aston Villa   Brentford 
         14          12           6 
  • You can also use names():
x <- c(14,12,6)  ### vector
n <- c('Tottenham','Aston Villa','Brentford')    ### vector of names
names(x) <- n    ### assigning names
x
  Tottenham Aston Villa   Brentford 
         14          12           6 

Basic data structures (4)

Matrices

  • In R, a matrix is a collection of similar data types arranged in a two-dimensional rectangular layout. They are usually created with the matrix() function:
matrix(data = c(1,2,3,5,8,13), ### the data elements (First Fibonacci numbers)
       ncol = 3,              ### number of columns
       nrow = 2,              ### number of rows
       byrow = TRUE)          ### fill matrix by rows
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    5    8   13

Basic data structures (5)

Named matrices

  • As for named vectors, named matrices can contain labels to be attached to rows and/or columns:
### Generating a named matrix
M <- matrix(data = c(1,2,3,5,8,13), ### the data elements (First Fibonacci numbers)
            ncol = 3,              ### number of columns
            nrow = 2,              ### number of rows
            byrow = TRUE)          ### fill matrix by rows
rn <- c('r1','r2')       ### vector of rownames
cn <- c('c1','c2','c3')  ### vector of colnames
rownames(M) <- rn        ### assign rownames
colnames(M) <- cn        ### assign colnames
M
   c1 c2 c3
r1  1  2  3
r2  5  8 13

Basic data structures (6)

Lists

A collection of objects (numbers, vectors, matrices, etc.). Lists are the most general and flexible elements in R because they can contain elements of any type (including other lists).

new_list <- list(
  A = matrix(c(4, 1, 1, 8), ncol = 2),
  y = c(1, 2, 6, 6, 9)
)

new_list
$A
     [,1] [,2]
[1,]    4    1
[2,]    1    8

$y
[1] 1 2 6 6 9

Basic data structures (7)

Data frames

  • A class of objects to represent data matrices.
  • The rows correspond to statistical units (i.e., observations)
  • The columns correspond to variables.

Basic data structures (8)

Data frames

  • The R software already contains some objects of class data.frame saved in memory, for example:
head(iris,n = 10)
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Basic data structures (9)

Data frames

  • NB: Internally, an object of class data.frame is saved as a list whose elements all have the same length and, typically, a name.

  • For example

str(iris)
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Basic data structures (10)

Data frames

  • New data frames are usually created with the data.frame() function.
  • Beware: data.frame()’s default behaviour turns strings into factors

Standard reaction

Little detour: factors

Factors - Definition

  • They are used to represent categorical data and can be either ordinal (e.g., company hierarchies) or non-ordinal (e.g., hair color).
  • A factor MUST be imagined as a vector of integers, where each integer is associated with a label.

Factors: example

Let’s try to create our first factor:

x <- factor(c("yes", "yes", "no"))
x
[1] yes yes no 
Levels: no yes
  • The order in which the levels are represented can be modified using the levels argument of the factor function. By default, the levels are ordered alphabetically.

  • Additionally, if the levels have a hierarchy (e.g., soldier, lieutenant, marshal, etc.), we can indicate this by specifying ordered = TRUE in the factor function.

Factors: example

Given a factor, we can use the table function to obtain a table with the levels and frequencies of the variable.

x <- factor(c("A", "B", "A", "B"))
x  # printing the factor
[1] A B A B
Levels: A B
str(x)  # structure of the factor
 Factor w/ 2 levels "A","B": 1 2 1 2
table(x)  # table with levels and frequencies
x
A B 
2 2 

Factors: Warning

Never forget that a factor is nothing more than an integer associated with a label.

x <- factor("a")
y <- factor("b")
c(x, y)
[1] a b
Levels: a b
  • A very convenient package to work with factors is forcats

Basic data structures (11)

Data frames

  • To avoid the problem of converting strings into factors, use stringsAsFactors = FALSE when creating data frames
v1 <- c(10,20,30)                                ### numeric vector
v2 <- c('a','b','c')                             ### character vector
v3 <- c(TRUE,TRUE,FALSE)                         ### logical vector
data.frame(v1, v2, v3, stringsAsFactors = FALSE) ### data.frame
  v1 v2    v3
1 10  a  TRUE
2 20  b  TRUE
3 30  c FALSE
  • A more modern-like data.frame object is a tibble from the tibble package

Subsetting Data Structures (1)

Subsetting Vectors

Values in a vector are retrieved by using the single square bracket [] operator:

s = c("a"=5, "b"=4, "c"=3, "d"=2, "e"=1)
s[3]
c 
3 

You can also drop elements from a vector with -

## drop the 3rd element
s[-3]
a b d e 
5 4 2 1 

Subsetting Data Structures (2)

Subsetting Vectors

Out-of-range subsetting produces an ‘NA’ value:

## out-of-range index returns NA
s[10]
<NA> 
  NA 

You can also retrieve more than one element:

indx <- c(2, 3, 5, 5)
s[indx]
b c e e 
4 3 1 1 

Subsetting Data Structures (3)

Subsetting Vectors

You can drop more than one element from a vector with -

indx <- c(1, 3)
s[-indx]
b d e 
4 2 1 

You can also retrieve elements with their names:

i_names <- c('d', 'b')
s[i_names]
d b 
2 4 

Subsetting Data Structures (4)

Subsetting Vectors

You can also use logical vectors to retrieve values:

i_logical <- c(FALSE, FALSE, TRUE, FALSE, FALSE)
s[i_logical]
c 
3 

The logical vector will be recycled if it is shorter than the vector to subset:

i <- c(FALSE, TRUE)  ## ->  c(FALSE, TRUE, FALSE, TRUE, FALSE)
s[i]
b d 
4 2 

VERY DANGEROUS BEHAVIOUR!

Subsetting Data Structures (5)

Conditional Subsetting

We can also use conditional subsetting:

## select elements greater than 2
i <- s > 2
s[i]
a b c 
5 4 3 

Subsetting Data Structures (6)

Subsetting Matrices

Values in a matrix are retrieved by using the [,] operator, placing the row and column dimension before and after the comma:

M <- matrix(1:12, nrow = 3, ncol = 4, byrow = TRUE)
rownames(M) <- c('r1', 'r2', 'r3')
colnames(M) <- c('c1', 'c2', 'c3', 'c4')
M ## print the full matrix
   c1 c2 c3 c4
r1  1  2  3  4
r2  5  6  7  8
r3  9 10 11 12

Subsetting Data Structures (7)

Subsetting Matrices

M[2, 3]
[1] 7

You can also retrieve an entire row or column:

M[1, ]
c1 c2 c3 c4 
 1  2  3  4 
M[, 1]
r1 r2 r3 
 1  5  9 

Subsetting Data Structures (8)

Subsetting Matrices

Retrieving more than one column/row:

i <- c(2, 3)
M[i, ]
   c1 c2 c3 c4
r2  5  6  7  8
r3  9 10 11 12
M[c(1, 3), c(2, 4)]
   c2 c4
r1  2  4
r3 10 12

Subsetting Data Structures (9)

Subsetting Matrices

We can use names of columns and rows:

i <- c('r1', 'r3')
M[i, ]
   c1 c2 c3 c4
r1  1  2  3  4
r3  9 10 11 12
i <- c('c2', 'c4')
M[, i]
   c2 c4
r1  2  4
r2  6  8
r3 10 12
i <- c('c2', 'c4')
M[3, i]
c2 c4 
10 12 

Subsetting Data Structures (10)

Subsetting Matrices

We can also use logical vectors:

i <- c(TRUE, FALSE, FALSE)
M[i, ]
c1 c2 c3 c4 
 1  2  3  4 
i <- c(TRUE, FALSE)  ## ->  c(TRUE, FALSE, TRUE)
M[i, ]
   c1 c2 c3 c4
r1  1  2  3  4
r3  9 10 11 12
i <- M[, 'c3'] < 2 * M[, 'c1']
M[i, 'c4']
r2 r3 
 8 12 

Subsetting Data Structures (10)

Subsetting Lists

Slightly different from vectors. In particular:

new_list[2]
$y
[1] 1 2 6 6 9
str(new_list[2]) # A LIST containing only the second element
List of 1
 $ y: num [1:5] 1 2 6 6 9

But

new_list[[2]]
[1] 1 2 6 6 9
str(new_list[[2]]) # the second element of the list
 num [1:5] 1 2 6 6 9

Subsetting Data Structures (11)

Subsetting Named lists

We can extract elements using $

new_list$A
     [,1] [,2]
[1,]    4    1
[2,]    1    8

Or

new_list[["A"]]
     [,1] [,2]
[1,]    4    1
[2,]    1    8

A nice explanation on this topic here

Subsetting Data Structures (12)

Subsetting Data Frames

Being lists, extracting one or more columns works as usual

head(iris$Sepal.Length) # output: vector
[1] 5.1 4.9 4.7 4.6 5.0 5.4
head(iris[["Sepal.Length"]]) # output: vector
[1] 5.1 4.9 4.7 4.6 5.0 5.4
head(iris[[1]]) # output: vector
[1] 5.1 4.9 4.7 4.6 5.0 5.4
head(iris[1]) # output: dataframe with 1 column named Sepal.Length
  Sepal.Length
1          5.1
2          4.9
3          4.7
4          4.6
5          5.0
6          5.4

Subsetting Data Structures (13)

Subsetting Data Frames

In addition, R allows to use a typical matrix syntax

iris[1, ] # the first row
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
head(iris[, 1]) # the first column (as a numerica vector)
[1] 5.1 4.9 4.7 4.6 5.0 5.4
iris[1, 3] # the observation corresponding to the first row and the third column
[1] 1.4
head(iris[, 1,drop=FALSE],n = 3) # the first column (as a one-dimensional data frame)
  Sepal.Length
1          5.1
2          4.9
3          4.7